Inferring speakers physical attributes from their voices

نویسندگان

  • Robert M. Krauss
  • Robin Freyberg
  • Ezequiel Morsella
چکیده

Two experiments examined listeners ability to make accurate inferences about speakers from the nonlinguistic content of their speech. In Experiment I, na€ıve listeners heard male and female speakers articulating two test sentences, and tried to select which of a pair of photographs depicted the speaker. On average they selected the correct photo 76.5% of the time. All performed at a level that was reliably better than chance. In Experiment II, judges heard the test sentences and estimated the speakers age, height, and weight. A comparison group made the same estimates from photographs of the speakers. Although estimates made from photos are more accurate than those made from voice, for age and height the differences are quite small in magnitude—a little more than a year in age and less than a half inch in height. When judgments are pooled, estimates made from photos are not uniformly superior to those made from voices. 2002 Elsevier Science (USA). All rights reserved. Most people have had the experience of seeing for the first time a speaker whose voice is familiar (from telephone conversations, the radio, etc.), and being surprised by that person s appearance. The fact that people are surprised in such situations suggests they expect their mental images of speakers to have some degree of verisimilitude. To what extent are such expectations justified? More generally, what do we know about the inferences listeners make from speakers voices? It has long been known that, quite apart from what is said, a speaker s voice conveys considerable information about the speaker, and that listeners utilize this information in evaluations and attributions. Giles and Powsland (1975) provide a useful (albeit now somewhat outdated) review of research on this topic. Perhaps the most familiar example of how listeners spontaneously use variations in speakers voices is the biasing effect of dialects associated with social class. Status variation in language use occurs in most societies (Guy, 1988), and it is remarkable how accurately na€ıve listeners can utilize these variations to identify a speaker s socioeconomic status (SES). Judgments of SES based on hearing speakers read a brief standard passage are highly correlated with measured SES, and even so minimal a speech sample as counting from 1 to 10 yields reasonably accurate judgments (Ellis, 1967). Lower (and working) class speakers tend to be judged less favorably than middle-class speakers (Smedley & Bayton, 1978; Triandis & Triandis, 1960), and middleclass judges perceive themselves to be more similar to middle-class speakers than to lower class speakers (Dienstbier, 1972). One might expect that research on the inferences listeners make from speech would be part of the study of speech perception, but for interesting reasons that is not the case. For speech perception researchers, the fundamental issue has been one that is common to all psychological studies of perception: constancy. Spoken language shows variability in its realization, but stability in its perception, and the primary goal of speech perception research is to explain how this is accomplished— how a perceiver arrives at a stable percept from a highly variable stimulus. Goldinger makes the point with regard to word recognition: Most theories of spoken word identification assume that variable speech signals are matched to canonical representations in memory. To achieve this, idiosyncratic voice details are first normalized, allowing direct comparison of the input to the lexicon (Goldinger, 1995, p. 1166). Journal of Experimental Social Psychology 38 (2002) 618–625 www.academicpress.com Journal of Experimental Social Psychology Corresponding author. Fax: +1-212-854-3609. E-mail address: [email protected] (R.M. Krauss). 0022-1031/02/$ see front matter 2002 Elsevier Science (USA). All rights reserved. PII: S0022 -1031 (02 )00510-3 Comprehending speech requires the hearer to distinguish between variability in the acoustic signal that is linguistically significant (i.e., that contributes to comprehension of the utterance s intended meaning) and variability that is not. A great deal of the variability found in speech does not contribute to comprehension, while at the same time tokens of the same linguistic type (that must be perceived as equivalent for purposes of comprehension) can differ markedly in their realization. Some of this variability is the result of languagespecific coarticulation rules and typically goes unnoticed by the listener, but some of it reflects important attributes of the speaker that can serve as a basis for inferences about his or her identity, attitude, emotional state, definition of the situation, etc. For example, systematic variation in the articulation of certain phonemes distinguishes dialects and accents. Dialects are associated with speech communities, and reflect regional origin and SES. Stereotypes associated with the speech communities (Southerners are stupid, New Yorkers are venal and rude, poor people are lazy) affect the way the speaker s behavior is perceived (Giles & Powsland, 1975). Variation in fundamental frequency (F0), amplitude, rate and fluency may be related to momentary changes in the speaker s internal state. The most intensively investigated of these internal states is affective arousal. F0, amplitude and syllabic rate increase, and fluency decreases, when arousal is high (Hecker, Stevens, von Bismarck, & Williams, 1968; Streeter, Krauss, Geller, Olson, & Apple, 1977; Streeter, Macdonald, Apple, Krauss, & Galotti, 1983; Williams & Stevens, 1972)—but it is likely that finer distinctions could be made. Anatomical differences constitute another source of variability. Speakers vocal tracts differ, and each produces a signal that is acoustically distinctive, although the audible differences between any pair of voices may be small and not readily discernible. Gross differences in the vocal tract are related to inter-individual differences on a number of personal attributes. Perhaps the most familiar is age. The physiological changes that mark the progression from infant to toddler to adolescent to adult are paralleled by striking changes in voice quality; only slightly less familiar are the vocal changes that accompany the transition from adulthood to old age (Caruso, Mueller, & Shadden, 1995; Ramig, 1986; Ramig & Ringel, 1983). Anatomy also accounts for some of the difference among the voices of speakers of the same age. Just as children s voices deepen as their size increases, adult speakers who are large tend to have lower, more resonant voices than speakers who are small, although the correlation is far from perfect. In all likelihood there are other acoustic correlates of size and physique, although they are not uncomplicated. Several investigators have reported relationships between na€ıve listeners estimates from voice samples of such attributes as age, height, and weight and the actual values (Allport & Cantril, 1934; Lass & Colt, 1980; Lass & Davis, 1976; van Dommelen, 1993). Unfortunately, differences in method, sample characteristics and measures make it difficult to reach general conclusions about how accurate na€ıve listeners estimates are. In the typical study, a relatively large number of listeners hears samples of speakers voices, and estimates each speaker s age (or some other attribute). The mean estimate for each speaker is calculated, and the average difference between mean of the estimated ages and the actual ages is used as a measure of accuracy. Although such statistics are often presented as an index of people s accuracy in estimating age from voice, what they really reflect is the accuracy of judges pooled estimates. For example, Lass and Colt (1980) reported a mean difference between a speaker s actual height and height estimated from voice to be 1:4 in for female speakers and 0:49 in for male speakers. These values represent the difference between the mean of judges estimates of speakers heights and the mean actual height in the sample of speakers, and tell us little about how accurately the height of an individual speaker is likely to be estimated by the average judge. Nearly all of the previous studies have used speech samples drawn from college populations, which restricts the range of such variables as age. In the experiments reported here, we took pains to obtain a more heterogeneous sample of speakers. Using this sample, we examined the ability of listeners to match speakers pictures to their voices and to estimate speakers physical attributes from their voices. In Experiment I, na€ıve listeners heard speakers reading standard test sentences, and then saw a pair of pictures. Their task was to identify the pictures of the speaker. In Experiment II, judges heard the test sentences and estimated the speaker s age, height, and weight. For comparison purposes, another set of judges made the same estimates from photographs of the speakers. Experiment 1. Speaker identification

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Estimation of Children's Physical Characteristics from Their Voices

To date, multiple strategies have been proposed for the estimation of speakers’ physical parameters such as height, weight, age, gender etc. from their voices. These employ various types of feature measurements in conjunction with different regression and classification mechanisms. While some are quite effective for adults, they are not so for children’s voices. This is presumably because in ch...

متن کامل

Perceptual scaling of voice identity: common dimensions for different vowels and speakers.

THE AIMS OF OUR STUDY WERE (1) to determine if the acoustical parameters used by normal subjects to discriminate between different speakers vary when comparisons are made between pairs of two of the same or different vowels, and if they are different for male and female voices; (2) to ask whether individual voices can reasonably be represented as points in a low-dimensional perceptual space suc...

متن کامل

Prosodic predictors of upcoming positive or negative content in spoken messages.

This article examines potential prosodic predictors of emotional speech in utterances perceived as conveying that good or bad news is about to be delivered. Speakers were asked to call an experimental confederate to inform her about whether or not she had been given a job she had applied for. A perception study was then performed in which initial fragments of the recorded utterances, not contai...

متن کامل

Building personalised synthetic voices for individuals with severe speech impairment

For individuals with severe speech impairment accurate spoken communication can be difficult and require considerable effort. Some may choose to use a voice output communication aid (or VOCA) to support their spoken communication needs. A VOCA typically takes input from the user through a keyboard or switch-based interface and produces spoken output using either synthesised or recorded speech. ...

متن کامل

How trustworthy is your voice? The effects of voice manipulation on the perceived trustworthiness of novel speakers

How trustworthy is your voice? The effects of voice manipulation on the perceived trustworthiness of novel speakers How trustworthy is your voice? The effects of voice manipulation on the perceived trustworthiness of novel speakers ABSTRACT A person's voice is not only loaded with cues to age, sex and emotional state, listeners also readily form personality impressions of novel speakers. Based ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002